Skip to content

feat(eval): add PersonaMem MCQ benchmark + full v1-32k raw eval (69.4%)#40

Merged
ch-liuzhide merged 1 commit into
mainfrom
eval/personamem-mcq
Jun 18, 2026
Merged

feat(eval): add PersonaMem MCQ benchmark + full v1-32k raw eval (69.4%)#40
ch-liuzhide merged 1 commit into
mainfrom
eval/personamem-mcq

Conversation

@ch-liuzhide

Copy link
Copy Markdown
Collaborator

Summary

PersonaMem was the last registered benchmark with no report. It is a 4-option multiple-choice personalization benchmark, but the wired-up bench inherited free-text QA + LLM judge that never showed the model the options. This PR rewires it to exact-match MCQ letter accuracy (no LLM judge) — the dataset authors' own protocol ("No LLM judges are involved") — and runs the full v1-32k split.

Fixes that made it runnable & faithful

  • Adapter (eval/datasets/personamem.py): ast.literal_eval fallback for Python-repr all_options (recovers 303/589 rows json.loads silently dropped to 0 options); one scenario per (context, end_index) cut point so each question's haystack is exactly turns[:end_index] — no future-turn or cross-persona leakage; carries options + gold letter in metadata.
  • Judge (eval/judge.py): select_choice() MCQ reader (temp 0, presents the 4 options).
  • Bench (eval/benchmarks/personamem_bench.py): per-cut-point partition ingest (mem_ prefix — partition ids must match ^mem_[a-z0-9_]+$), partition-scoped retrieval, letter exact-match, per-question-type accuracy; ingest + search retry so a transient timeout scores one question (counted in error_rate) instead of aborting all 589.

Resulteval/reports/personamem/v1/run-1/ (589 q, shipped default = all-MiniLM-384 + bge-reranker-base rerank, top-10, DeepSeek-V4-Pro reader temp 0):

Metric Value
MCQ accuracy 69.4% (409/589)
answered-only (excl. 8 infra timeouts) 70.4%
valid_choice_rate 98.6%

Above the 25% random baseline and the ~50–52% full-context frontier oracle (arXiv:2504.14225) — while reading only top-10 retrieved memories. Strongest on recalling why a preference changed (88.9%), weakest on generative suggest new ideas (39.8%), matching the paper's difficulty curve.

Public pages (EN + zh/ mirror) rewritten from the stale free-text "67.6% QA (37q)" framing to correct MCQ framing with the right anchors; eval/README.md table row fixed (was "QA judge / —").

A 3-lens adversarial verification (correctness / methodology / isolation) confirmed the metric is faithful to the official protocol and the isolation is leak-free.

Test plan

  • All 31 eval tests pass (pytest tests/eval tests/test_eval)
  • Adapter loads 589 questions across 222 scenarios; every question has 4 options + gold letter present
  • Full 589-q run completes; report + per-category breakdown render
  • (optional, not in this PR) same-reader full-context control to isolate retrieval's contribution

Note (separate issue)

The ^mem_[a-z0-9_]+$ partition validation also rejects MemBench's membench_… partition ids, so a MemBench re-run today would 404 on ingest (its existing reports predate the validation). Not fixed here.

🤖 Generated with Claude Code

PersonaMem was the last registered benchmark with no report. It is a
4-option multiple-choice personalization benchmark, but was wired to
free-text QA + LLM judge that never showed the model the options.
Rewire it to exact-match MCQ letter accuracy (no LLM judge), matching
the dataset authors' protocol.

- adapter: ast.literal_eval fallback for Python-repr `all_options`
  (recovers 303/589 rows json.loads silently dropped to 0 options);
  one scenario per (context, end_index) cut point so each question's
  haystack is exactly turns[:end_index] -- no future/cross-persona leak;
  carry options + gold letter in metadata.
- judge: select_choice() MCQ reader (temp 0, presents options).
- bench: per-cut-point partition ingest (mem_ prefix; partition ids must
  match ^mem_[a-z0-9_]+$), partition-scoped retrieval, letter exact-match,
  per-question-type accuracy; ingest + search retry so a transient timeout
  scores one question (counted in error_rate) instead of aborting the run.

Full v1-32k raw run (589 q, all-MiniLM-384 + bge-reranker-base, top-10,
DeepSeek-V4-Pro reader): MCQ accuracy 69.4% (409/589), above the ~50-52%
full-context frontier oracle and the 25% chance baseline. Report in
eval/reports/personamem/v1/run-1/; public pages (EN + zh) rewritten from
the stale free-text framing; README metric/version fixed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements the PersonaMem multiple-choice personalization benchmark, including its runner, dataset adapter, and an MCQ prompt/reader method. It also updates the documentation and adds evaluation reports showing a 69.4% accuracy. The review feedback suggests logging exceptions in _ensure_partition instead of swallowing them silently, and recommends replacing fragile manual string slicing with the robust _PAREN_LETTER_RE regex when parsing option letters.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +66 to +71
async def _ensure_partition(client: HebbClient, partition_id: str) -> None:
"""Create the partition (idempotent — swallow already-exists errors)."""
try:
await client.create_partition(partition_id, name=partition_id)
except Exception:
pass

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Swallowing all exceptions silently in _ensure_partition can hide genuine issues like network failures, authentication errors, or configuration problems, making them hard to debug. It is highly recommended to at least log the exception at a DEBUG or WARNING level so that unexpected failures are visible in the logs.

Suggested change
async def _ensure_partition(client: HebbClient, partition_id: str) -> None:
"""Create the partition (idempotent — swallow already-exists errors)."""
try:
await client.create_partition(partition_id, name=partition_id)
except Exception:
pass
async def _ensure_partition(client: HebbClient, partition_id: str) -> None:
"""Create the partition (idempotent — swallow already-exists errors)."""
try:
await client.create_partition(partition_id, name=partition_id)
except Exception as e:
logger.debug("Failed to ensure partition %s (it may already exist): %s", partition_id, e)

Comment on lines +187 to +191
valid = {
o.strip()[1].lower()
for o in options
if len(o.strip()) > 2 and o.strip()[0] == "("
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The manual string slicing o.strip()[1] to extract the option letter is fragile and assumes a strict format. Since _PAREN_LETTER_RE is already defined in this file, it should be used to robustly parse the option letters and avoid potential issues if the option format has leading spaces or other minor variations.

Suggested change
valid = {
o.strip()[1].lower()
for o in options
if len(o.strip()) > 2 and o.strip()[0] == "("
}
valid = {
m.group(1).lower()
for o in options
if (m := _PAREN_LETTER_RE.match(o.strip()))
}

Comment on lines +231 to +234
chosen_text = next(
(o for o in options if o.strip()[1:2].lower() == chosen),
f"({chosen})",
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similarly, using o.strip()[1:2] to find the matching option is fragile. Using the existing _PAREN_LETTER_RE regex makes this extraction much more robust and consistent with the rest of the parsing logic.

Suggested change
chosen_text = next(
(o for o in options if o.strip()[1:2].lower() == chosen),
f"({chosen})",
)
chosen_text = next(
(o for o in options if (m := _PAREN_LETTER_RE.match(o.strip())) and m.group(1).lower() == chosen),
f"({chosen})",
)

@ch-liuzhide ch-liuzhide merged commit 0c57bc2 into main Jun 18, 2026
18 checks passed
@ch-liuzhide ch-liuzhide deleted the eval/personamem-mcq branch June 18, 2026 11:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant